在本文中,我们提出了基于抑制增强面膜的注意力和交互式通道转换(Semicon),以学习处理大规模细粒图像检索任务的二进制哈希码。在半号中,我们首先开发出基于抑制增强的面膜(SEM)的注意力,以动态定位判别图像区域。更重要的是,与现有的注意机制不同,我们的SEM是为了限制此类区域而开发的,然后通过考虑以阶段的方式考虑激活区域之间的关系来限制其他互补区域。在每个阶段,交互式通道变换(ICON)模块之后旨在利用跨参与激活张量的通道之间的相关性。由于通道通常可以与细粒对象的部分相对应,因此也可以相应地建模该部分相关性,从而进一步提高细粒的检索精度。此外,要作为计算经济,图标是通过有效的两步过程实现的。最后,对我们的分号的哈希学习由全球和本地级分支组成,以更好地表示细粒对象,然后生成与多个级别相对应的二进制哈希码。五个基准细粒数据集的实验显示了我们优于竞争方法。
translated by 谷歌翻译
半监督的几次学习在于培训分类器以适应有限的标记数据和固定数量未标记的数据的新任务。已经开发了许多复杂的方法来解决该问题所包含的挑战。在本文中,我们提出了一种简单但相当有效的方法,可以从间接学习的角度预测未标记数据的准确伪标记,然后增强在几个拍摄分类任务中设置的极其标签受限的支持。我们的方法只能通过仅使用现成的操作来仅在几行代码中实现,但是它能够在四个基准数据集上超越最先进的方法。
translated by 谷歌翻译
由于课程中的训练样本极端不平衡,长尾实例分割是一个具有挑战性的任务。它导致头部课程的严重偏差(含有多数样本)对尾尾。这呈现“如何适当地定义和缓解偏见”最重要的问题之一。先前作品主要使用标签分布或平均分数信息来表示粗粒偏置。在本文中,我们探索挖掘困难的矩阵,该矩阵携带细粒度的错误分类细节,以减轻成对偏置,概括粗液。为此,我们提出了一种新颖的成对类余额(PCB)方法,基于混淆矩阵,在训练期间更新以累积正在进行的预测偏好。 PCB在培训期间生成正规化的纠错软标签。此外,开发了一种迭代学习范例,以支持这种脱结的渐进和平稳的正则化。 PCB可以插入并播放任何现有方法作为补充。 LVIS的实验结果表明,我们的方法在没有钟声和口哨的情况下实现最先进的性能。各种架构的卓越结果表明了泛化能力。
translated by 谷歌翻译
细粒度的图像分析(FGIA)是计算机视觉和模式识别中的长期和基本问题,并为一组多种现实世界应用提供了基础。 FGIA的任务是从属类别分析视觉物体,例如汽车或汽车型号的种类。细粒度分析中固有的小阶级和阶级阶级内变异使其成为一个具有挑战性的问题。利用深度学习的进步,近年来,我们在深入学习动力的FGIA中见证了显着进展。在本文中,我们对这些进展的系统进行了系统的调查,我们试图通过巩固两个基本的细粒度研究领域 - 细粒度的图像识别和细粒度的图像检索来重新定义和扩大FGIA领域。此外,我们还审查了FGIA的其他关键问题,例如公开可用的基准数据集和相关域的特定于应用程序。我们通过突出几个研究方向和开放问题,从社区中突出了几个研究方向和开放问题。
translated by 谷歌翻译
Our work focuses on tackling the challenging but natural visual recognition task of long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples). In the literature, class re-balancing strategies (e.g., re-weighting and re-sampling) are the prominent and effective methods proposed to alleviate the extreme imbalance for dealing with long-tailed problems. In this paper, we firstly discover that these rebalancing methods achieving satisfactory recognition accuracy owe to that they could significantly promote the classifier learning of deep networks. However, at the same time, they will unexpectedly damage the representative ability of the learned deep features to some extent. Therefore, we propose a unified Bilateral-Branch Network (BBN) to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately. In particular, our BBN model is further equipped with a novel cumulative learning strategy, which is designed to first learn the universal patterns and then pay attention to the tail data gradually. Extensive experiments on four benchmark datasets, including the large-scale iNaturalist ones, justify that the proposed BBN can significantly outperform state-of-the-art methods. Furthermore, validation experiments can demonstrate both our preliminary discovery and effectiveness of tailored designs in BBN for long-tailed problems. Our method won the first place in the iNaturalist 2019 large scale species classification competition, and our code is open-source and available at https://github.com/Megvii-Nanjing/BBN . * Q. Cui and Z.-M. Chen's contribution was made when they were interns in Megvii Research Nanjing, Megvii Technology, China. X.
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Given the increasingly intricate forms of partial differential equations (PDEs) in physics and related fields, computationally solving PDEs without analytic solutions inevitably suffers from the trade-off between accuracy and efficiency. Recent advances in neural operators, a kind of mesh-independent neural-network-based PDE solvers, have suggested the dawn of overcoming this challenge. In this emerging direction, Koopman neural operator (KNO) is a representative demonstration and outperforms other state-of-the-art alternatives in terms of accuracy and efficiency. Here we present KoopmanLab, a self-contained and user-friendly PyTorch module of the Koopman neural operator family for solving partial differential equations. Beyond the original version of KNO, we develop multiple new variants of KNO based on different neural network architectures to improve the general applicability of our module. These variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier-Stokes equation and the Bateman-Burgers equation) and ERA5 (i.e., one of the largest high-resolution data sets of global-scale climate fields). These demonstrations suggest the potential of KoopmanLab to be considered in diverse applications of partial differential equations.
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译